Work your way through the textbook lab 6.5.1 Best Subset Selection, and the forward stepwise procedure in 6.5.2, then answer these questions.
library(knitr)
library(tidyverse)
library(ISLR)
library(skimr)
library(leaps)
library(patchwork)
library(rsample)
library(tidymodels)
library(glmnet)
data(Hitters)
# Take a look at the data
skim(Hitters)
# Handle the missing values on Salary,
# by removing them
Hitters <- Hitters %>%
filter(!is.na(Salary))
# Subset selection
regfit.full <- regsubsets(Salary~., Hitters)
summary(regfit.full)
regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. By default it only examines up to 8 variable models. Which variables make the best 8 variable model?coef(regfit.full, 8)
# (Intercept) AtBat Hits Walks CHmRun CRuns
# 130.9691577 -2.1731903 7.3582935 6.0037597 1.2339718 0.9651349
# CWalks DivisionW PutOuts
# -0.8323788 -117.9657795 0.2908431
nvmax=19. Plot the model fit diagnostics for the best model of each size. What would these diagnostics suggest about an appropriate choice of models? Do your results compare with the text book results? Why not?regfit.full <- regsubsets(Salary~., data=Hitters, nvmax=19)
reg.summary <- summary(regfit.full)
#names(reg.summary)
models <- tibble(nvars=1:19, rsq=reg.summary$rsq,
rss=reg.summary$rss,
adjr2=reg.summary$adjr2,
cp=reg.summary$cp,
bic=reg.summary$bic)
p1 <- ggplot(models, aes(x=nvars, y=rsq)) + geom_line()
p2 <- ggplot(models, aes(x=nvars, y=rss)) + geom_line()
p3 <- ggplot(models, aes(x=nvars, y=adjr2)) + geom_line()
p4 <- ggplot(models, aes(x=nvars, y=cp)) + geom_line()
p5 <- ggplot(models, aes(x=nvars, y=bic)) + geom_line()
p1 + p2 + p3 + p4 + p5
BIC would suggest 6 variables. (It gets worse after 6, and then better at 8, and then worse again.) The others suggest around 10. The textbook suggests 6 variables, so similar results here.
regfit.fwd <- regsubsets(Salary~., data=Hitters, nvmax=19, method="forward")
#summary(regfit.fwd)
d <- tibble(nvars=0:19,
rss=regfit.fwd$rss)
ggplot(d, aes(x=nvars, y=rss)) + geom_line()
Full model
coef(regfit.full, 6)
# (Intercept) AtBat Hits Walks CRBI DivisionW
# 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
# PutOuts
# 0.2643076
Forward selection
coef(regfit.fwd, 6)
# (Intercept) AtBat Hits Walks CRBI DivisionW
# 91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
# PutOuts
# 0.2643076
We are using RSS, because this is returned by the forward stepwise procedure. Look for decreasing values, and when it flattens out. The suggestion is at 6 or 10 variables.
The models with 6 predictors are identical.
set.seed(1)
split <- initial_split(Hitters, 2/3)
h_tr <- training(split)
h_ts <- testing(split)
regfit.best <- regsubsets(Salary~., data=h_tr, nvmax=19)
test.mat <- model.matrix(Salary~., data=h_ts)
val.errors <- rep(NA,19)
for (i in 1:19) {
coefi <- coef(regfit.best, id=i)
pred <- test.mat[,names(coefi)]%*%coefi
val.errors[i] <- mean((h_ts$Salary-pred)^2)
}
val.errors
# [1] 106172.08 81664.60 90858.72 81614.97 77020.89 80218.62 76664.07
# [8] 72035.94 72137.70 70979.12 71767.23 71812.66 72429.40 71887.34
# [15] 71435.67 71834.43 71908.20 71717.66 71769.64
d2 <- tibble(nvars=1:19,
err = val.errors)
ggplot(d2, aes(x=nvars, y=err)) + geom_line()
coef(regfit.best, which.min(val.errors))
# (Intercept) AtBat Hits Walks CAtBat CRuns
# 223.2613709 -2.8895562 8.8700075 6.0558531 -0.1430211 1.5174180
# CRBI CWalks DivisionW PutOuts Assists
# 0.7689911 -0.9155342 -94.1445405 0.2777674 0.3240994
coef(regfit.full, 10)
# (Intercept) AtBat Hits Walks CAtBat CRuns
# 162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
# CRBI CWalks DivisionW PutOuts Assists
# 0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
The model selection of best subsets provides different results, 10 predictors, which is likely due to using a subset of data for training the model. Note that we used the test error as the measure for choosing models, and the different metric could produce different results. (Interestingly, if a different random seed is used, you might get a different best model. In this case, you should select the variables that consistently get used in the best model from different seeds.)
The selected variables are the same as the 10 variable model fitted to the full set. The coefficients in the fitted model differ a little.
It is said that 10-fold cross-validation is a reasonable choice for dividing the data. What size data sets would this create for this data? Argue whether this is good or bad.
There isn’t a lot of data. With 10-fold cross-validation only about 20 cases are kept out each time, which leads to substantial variability between predictions from each set. I would suggest using 5-fold. (Note from running CV: When I run this it also produces results with considerable variability. The choice of number of variables is consistent for most \(k\), even though the variability in error is substantial.)
Here we will use lasso to fit a regularised regression model and compare the results with the best subset model.
min(val.errors)
# [1] 70979.12
coef(regfit.best, which.min(val.errors))
# (Intercept) AtBat Hits Walks CAtBat CRuns
# 223.2613709 -2.8895562 8.8700075 6.0558531 -0.1430211 1.5174180
# CRBI CWalks DivisionW PutOuts Assists
# 0.7689911 -0.9155342 -94.1445405 0.2777674 0.3240994
grid <- 10^seq(10,-2,length=100)
x <- model.matrix(Salary~., h_tr)[,-1]
y <- h_tr$Salary
lasso.mod <- glmnet(x, y, alpha=1, lambda=grid)
# Need a coefficient matrix of 100(nlambda)x19(p)
# as.matrix function converts sparse format into this
coeffs <- as.matrix(lasso.mod$beta)
coeffs <- cbind(var=rownames(coeffs), coeffs)
cv <- coeffs %>% as_tibble() %>%
gather(nval, coeffs, -var) %>%
mutate(coeffs=as.numeric(coeffs)) %>%
mutate(lambda=rep(lasso.mod$lambda, rep(19, 100)))
p <- ggplot(cv, aes(x=lambda, y=coeffs, group=var, label=var)) + geom_line() +
scale_x_log10(limits=c(-1, 100))
plotly::ggplotly(p)
# This is how the sample code plots
#plot(lasso.mod, xvar="lambda", xlim=c(-1, 5))
As seen from the few lines that are not near zero, there are just a few predictors that are important for predicting salary.
# Do cross-validation, using glmnet's function
set.seed(1)
cv.out <- cv.glmnet(x, y, alpha=1)
#cv.df <- tibble(lambda=cv.out$lambda, mse=cv.out$cvm,
# mse_l=cv.out$cvlo, mse_h=cv.out$cvup)
#ggplot(cv.df, aes(x=lambda, y=mse)) + geom_point() +
# scale_x_log10() +
# geom_errorbar(aes(ymin=mse_l, ymax=mse_h))
# This is how the sample code plots
plot(cv.out)
# Fit a single model to the best lambda, predict the test set, and compute mse
bestlam <- cv.out$lambda.min
bestlam
# [1] 3.72917
x_ts <- model.matrix(Salary~., h_ts)[,-1]
y_ts <- h_ts$Salary
lasso.pred <- predict(lasso.mod, s=bestlam, newx=x_ts)
y.test <- y_ts
mean((lasso.pred-y.test)^2)
# [1] 67452.34
# Fit the model
# alpha=1 means lasso
# Its strange that it still needs the grid of lambdas to fit
# but it seems necessary for the optimisation.
# Then the predictions for best lambda made with predict fn
out <- glmnet(x, y, alpha=1, lambda=grid)
lasso.coef <- predict(out, type="coefficients", s=bestlam)[1:20,]
lasso.coef
# (Intercept) AtBat Hits HmRun Runs RBI
# 190.7549874 -2.1348701 7.1951191 0.0000000 0.0000000 0.3933491
# Walks Years CAtBat CHits CHmRun CRuns
# 4.7012516 -10.4653138 0.0000000 0.0000000 0.1738710 0.6559825
# CRBI CWalks LeagueN DivisionW PutOuts Assists
# 0.4173751 -0.5238019 0.0000000 -101.2644847 0.2435969 0.1431246
# Errors NewLeagueN
# -1.2064392 0.0000000
lasso.coef[lasso.coef!=0]
# (Intercept) AtBat Hits RBI Walks Years
# 190.7549874 -2.1348701 7.1951191 0.3933491 4.7012516 -10.4653138
# CHmRun CRuns CRBI CWalks DivisionW PutOuts
# 0.1738710 0.6559825 0.4173751 -0.5238019 -101.2644847 0.2435969
# Assists Errors
# 0.1431246 -1.2064392
With the seed provided there are 13 non-zero coefficients, and an MSE of 67287.59.
The best subsets performs a little better than lasso. It has lower MSE. The lasso has fewer variables, though, and thus is a little easier to interpret.
Only one variable is very important for the model, which is it? (It occurs in every list of variables returned as important, and has the highest coefficient in the lasso model.) Several more variables frequently appear as important, which ones are these? Several others, appear occasionally in some models but always have very small coefficients. Can you name one of these? What does this tell you about the relationship between Salary and all the predictors?
DivisionW is by far the most important variable. It is always selected, and always has a large coefficient.
AtBat, Hits, Walks persistently appear with relatively small coefficients, and appear in the lasso.
PutOuts, CBRI and Assists appear regularly with really small coefficients.
In the lasso model, it is interesting that the variable NewLeagueN appears to be important, but it is the variable who’s coefficient is reduced to 0 quickly. It also never shows up in any of the subset selection models. Also, years makes an appearance in the lasso model, and is included as a non-zero coefficient in the final model, which differs from all the subset selection models.
There is only one major variable useful for predicting salary, which is DivisionW. This variable alone provides most of the predictive power. Small gains are obtained by adding additional variables.